UNIKN–DrafaTracing-MC3

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

 

 

Authors and Affiliations:

Timo Göbel, University of Konstanz, timo.goebel@uni-konstanz.de

Zdravko Monov, University of Konstanz, zdravko.monov@uni-konstanz.de

Toni Schmidt, University of Konstanz, toni.schmidt@uni-konstanz.de

Peter Bak, University of Konstanz, peter.bak@uni-konstanz.de

     

     

Tool(s):

We created a Java application for formatting data, computing frequencies and plotting result values. The Lingpipe library [1] conducted computation of the individual Levenshtein distances. Microsoft Excel was used for the Visualizations of MC 3.1, MC 3.2 and MC 3.3. For MC 3.4 we developed a Processing [2] application that visualizes the results and allows sorting by user interaction. Our prototypes were developed with an effort of approximately 40 hours.

 

 

Video:

 

Click here for the Video

 

Visualization tool:

 

Click here to launch to visualization tool

 

ANSWERS:


MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation.

 

The task of mini challenge 3.1 was to identify the region or country of origin for the current outbreak. Our initial hypothesis was that the country of origin differs less from all the outbreak sequences than the other countries and regions. We compared all outbreak sequences to the native sequences using the Levenshtein-Distance (Figure 1).

 

 

distance

 

Figure 1. Schematic description of the Levenshtein distance and its computation.

 

 

We created a Java application that calculated the Levenshtein distance using the Lingpipe library [1] and visualized the results in a bar chart. The implementation took approximately 5 hours.

 

As shown in the following bar-char (Figure 2), the country of origin is Nigeria_B, with the least (15, stdev=1.1) edit operations in average.

 

Figure 2. Levenshtein distance of all sequences of the current outbreak to all native sequences, the regions are shown on the x-axis, and the average number of edit operations on the y-axis.

 


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation.

 

The task of mini challenge 3.2 was to identify which patient (#123 or #51) likely contracted the illness from Nicolai and why. The determination of the patient was performed with the Levenshtein distance, in which expected a lower number of substitutions to indicate fewer mutations, and therefore higher likelihood of direct infection. The data was imported into our Java application, processed using the Levenshtein distance comparison and visualized in a table using Excel (see Table 1) which took approximately 4 hours.

 

 

 

Table 1. Comparison of Nicolai’s viral strain to the strains of patients 123 and 51 by using the Levenshtein distance. Fewer substitutions indicate higher likelihood of direct infection.

 

 Patient 123 is more likely to have contracted the illness from Nicolai since his strain differs in only one substitution from Nicolai’s (A->C, 269). The strain of Patient 51 differs from Nicolai’s in 3 substitutions (A->C, 494; C->T, 842; T->A, 946).

 

 


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them. 

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

 

In task 3 we had to determine the mutations leading to an increase in symptom severity. In order to reach that goal, we applied a pair wise comparison of outbreak sequences. More specifically, we compared all pairs that exhibit an increase in symptom severity while all other characteristics remain unchanged. That way we isolated the substitutions relevant only for symptom severity. We calculated the relative frequency of these substitutions using a Java program which took approximately 2 hours. The results are shown in Figure 3, in which the top 3 mutations are highlighted. Base positions 946 and 842 are the top two mutations, 161 and 223 together form the third worst mutation since they have an equal frequency.

 

Figure 3. The top 20 mutations that lead to an increase in symptom severity. The top mutations occurring most frequently in all the sequences of the current outbreak are highlighted in red.

 

 


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

 

The task of mini-challenge 3.4 was to identify the top three mutations that lead to the most dangerous viral strains. Our concept relies on pair wise sequence comparison of selected patients. The main point in this task is to isolate the base substitutions leading to the worst property – severe for symptoms, high for mortality etc.

 

There are two ways to generate pairs. The first procedure searches the dataset for pairs, in which only one disease characteristic changes to its worst state, while the other characteristics remain the same. This approach allows us to track individual base substitutions. We call this procedure "exclusive-pairs". In a preliminary analysis we noticed that the resulting pair set may not be sufficient enough for a reliable and exhaustive solution. The second way to generate pairs is to use all pairs, where one disease characteristic changes to its worst state, while ignoring the others. We call this approach "greedy-pairs". This model led to a larger dataset including alternative answers.

 

Once we had differentiated the bases for each disease characteristic, we applied an intersection operation. The overlapping of all bases yields the desired mutations.

We propose a ranking, which orders substitutions according to their frequency of occurrence. Our idea was to count the occurrence of each base substitution and then calculate the frequency relative to the number of patient pairs. Finally, we could use the weighted average of all frequencies among each disease characteristic as the comparison factor.

 

We extended our Java application to handle all disease characteristics; the implementation took approximately 30 hours. The algorithm generated patient pairs using either the exclusive- or the greedy-pairs procedure. A typical example would include pairs exposing mild to severe or moderate to severe symptoms – such pairs are relevant for the final result, whereas pairs exposing mild to moderate symptoms are discarded. The program determines the base substitutions and notes their relative frequency. A union of those pairs leads to a set of base substitutions for the corresponding disease characteristic. This process is repeated for all disease characteristics.

 

We then visualized our results using a separate Java application, which employs the Processing graphics library [2]. This visualization plots the base substitutions for each disease characteristic as well as the corresponding weighted average. We only considered bases in which at least one mutation appears, so that the visualized data is reduced to its relevant segments. Table 2 shows a typical output from our Processing tool. A cell indicates an individual base substitution; its label is located in the column header. On the left of each line, labels indicate the corresponding disease characteristic. For example, "To Severe Symptoms" means that this line shows mutations leading to severe symptoms. The number in brackets shows how many pairs were generated to extract this information. A white cell means that no base substitution occurred on that position for the corresponding disease characteristic.

 

For each base substitution we show the relative frequency with respect to the total number of pairs. This number is shown in each cell. The color intensity of the cell scales proportional to it. This offers the viewer the possibility to quickly compare frequency values.

 

5

 

Table 2. Output from the Processing tool: the grayscale color of each cell is an indicator of the frequency of the base substitution. The bottom line provides a weighted summary for each base by computing the average frequency of substitutions.

 

The bottom line shows the weighted average of the frequency of each base change. Since we wanted to extract base substitutions that worsen all disease characteristics, we only plot frequency values into cells corresponding to a substitution that worsens all disease characteristics. The application offers the possibility to sort each line by frequency. Table 3 shows an example of this case. This way the viewer can quickly see which mutations lead to the worst disease characteristics.

 

 

t4_2

 

Table 3. Base substitutions sorted according to their positions

 

 

Also, each disease characteristic can be sorted according to the frequency of the base substitutions - the user can click on the left side over the label of a disease characteristic. Table 4 depicts the situation, where the user has sorted the mutations leading to High Mortality, thus gaining direct knowledge over the most influencing substitutions.

 

 

t4_3

 

Table 4. Base substitutions sorted according to High Mortality influence

 

 

Finally, the bottom line - Weighted Average - can be sorted according to the overall frequency of the mutations. Using our Processing visualization tool, we were able to quickly determine the mutations, leading to the worst virus.

 

Table 5 displays the summary of our task for the exclusive-pairs model. The weighted average line shows no numbers in the top 3 bases, which means that there is no single substitution that affects all disease characteristics. Concerning the weighted average of the substitutions with highest frequencies we come up with the following top three mutations:

 

1. T→C, 842

2. A→T, 946

3. G→C, 161.

 

 

t4_4

 

Table 5. Results using the “exclusive-pairs” approach displayed in the red rectangle. There is no overlapping in all disease characteristics, when exclusive-pairs are used. The bottom cells are empty since we only plot frequency values into cells that affect all disease characteristics.

 

 

However, these results reflect only a subset of the total number of pairs. Using greedy-pairs, our results slightly change. Now there are substitutions that affect all disease characteristics. We consider the property of affecting all characteristics as more relevant than the absolute weighted averages of the frequencies. Substitutions affecting all characteristics may not occur with highest absolute frequency, yet only they indicate a mutation to a virus that alters all disease characteristics to their worst value. Following this policy, our final result, as shown in Table 6, is:

 

1*. G→C, 161

2*. G→C, 22

3*. C→A, 79.

 

t4_5

 

Table 6. End results using the "greedy-pairs" generation approach. The red rectangle encompasses the top 3 base substitutions with the higher occurrence frequency. The blue rectangle, on the other side, highlights the substitutions which affect all disease characteristics.

 

 

References

 

[1] LingPipe text processing toolkit for Java. http://alias-i.com/lingpipe/index.html.

[2] Processing, an open source programming language for creating images, animations and interactions. http://processing.org.